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Abstract 

This paper proposes an online transfer framework to capture the in¬ 
teraction among agents and shows that current transfer learning in re¬ 
inforcement learning is a special case of online transfer. Furthermore, 
this paper re-characterizes existing agents-teaching-agents methods as 
online transfer and analyze one such teaching method in three ways. 
First, the convergence of Q-learning and Sarsa with tabular represen¬ 
tation with a finite budget is proven. Second, the convergence of Q- 
learning and Sarsa with linear function approximation is established. 
Third, the we show the asymptotic performance cannot be hurt through 
teaching. Additionally, all theoretical results are empirically validated. 


Introduction 


Agents can autonomously learn to master sequential decision tasks by reinforcement 
learning Sutton and Barto ( 1998| ). Traditionally, reinforcement learning agents are trained 
and used in isolation. More recently, the reinforcement learning community became in¬ 
terested in interaction among agents to improve learning. 


There are many possible methods to assist agent’s learning Erez and Smart 


Taylor and Stone (2009). This paper focuses on action advice Torrey and Taylor 


(2008); 


(20131: 


as the student agent practices, the teacher agent suggests actions to take. This method 
requires only agreement of the action sets between teachers and students, while allow¬ 
ing for different state representations and different learning algorithms among teachers 
and students. 

Although this advice method is shown to empirically provides multiple benefits |Tor-| 
rey and Taylor ( 2013| ); Zimmer et al. ( 201 4[ >, existing work does not provide a formal 
understanding of teaching or advice. Therefore, this paper proposes a framework — an 
online transfer framework — to characterize the interaction among agents, aiming to 
understand the teaching or advice from the transfer learning perspective. We extend the 
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transfer learning framework in reinforcement learning proposed by Lazaric ( 2012) into 
online transfer learning which capture the online interaction between agents. Also, we 
show that 1 ) transfer learning is a special case of online transfer framework, and 2 ) our 
framework is similar to that of of active learning [Settles (2010 1 , but in a reinforcement 
learning setting. 

After introducing our novel framework, it can be used to analyze existing advice 
methods, such as action advice Torrey and Taylor ( j2013) . First, we prove the conver¬ 
gence of Q-learning and Sarsa with tabular representation with a finite amount of ad¬ 
vice. Second, the convergence of Sarsa and Q-learning with linear function approxima¬ 
tion is established with finite advice. The convergence means the algorithms converge 
to the optimal Q-value. Third, we show that a non-infinite amount of advice cannot 
change the student’s asymptotic performance. These three results are then confirmed 
empirically in a simple Linear Chain MDP and a more complex Pac-Man simulation. 


Background 

This section provides necessary background, adopting some notation introduced else¬ 
where |Sutto^andBarto] ([1998]! ; [Meloetal] ([2008]! . 


Markov Decision Process 


Let M = (S, A, P, f?, 7 ) be a Markov decision process (MDP) with a compact state 
set S and a finite action set A. P is the transition probability kernel. For any (s, a , s') £ 
S x A x S triplet the probability of transition from state s taking action a to state s' 
is defined as V[s' £ U\s,a] = P(U\s, a), where U is a Borel-measurable subsej^jof 
S. R : S x A x S —R is a bounded deterministic function which assigns a reward 
R(s, a, s') to transition from state s to state s' taking action a. The discount factor 
is 7 such that 0 < 7 < 1. The expected total discounted reward for M under some 
policy can be defined as E Erlo 7 tr ( s *> a t)] > where t is the time step and r(st,at) 
denotes the reward received for taking action at in state s t at time step t, according to 
reward distribution. For convenience, we omit the state and the action and only use r t 
to denote the reward received at time step t, so the expected total discounted reward 
can be written as E E^oTVt]- r(s, a) and f?(s, a, s') have following relationship: 
E [r(x,a)] = f s R(s, a, s')P(ds'\s, a). 

A policy is a mapping that outputs for each state-action pair (s, a). A deterministic 
policy 7 r is a mapping defined as ir : S —> A, while a stochastic policy is a mapping 
defined over S x A (i.e., ^[choose action a|at state s] = n(s, a).) 

The state-action function is the expected return for a state action pair under a 
given policy: Q v (s,a) = E ff [J2kLo ^ r t+k I s * = 8,0* = a]. Solving an MDP usu¬ 
ally means finding an optimal policy that maximizes the expected return. An opti¬ 
mal policy 7 r* is such that Q n > QA for all s £ S, all a £ A and all policies n. 
We can define the optimal state-value function Q* as Q*(s,a) = f s (R(s,a, s') + 
7 max a / gj 4 Q*(s', a'))P(ds'\s, a), representing the expected total discounted reward 


Details on Borel-measurable subsets can be found elsewhere 


Rudin 


1986 . 
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received along an optimal trajectory when taking action a in state s and following 
optimal policy 7 r* thereafter. For all s £ S, 7 r*(s) = argmax ag ^ Q*(s, a). Notice 
that although a stochastic policy may be optimal, there will always be a deterministic 
optimal policy with at least as high an expected value. 


Q-learning and Sarsa 


Q-learning is an important learning algorithm in reinforcement learning. It is a model- 
free and off-policy learning algorithm which is a break-through in reinforcement learn¬ 
ing control. Watkins ( jl989| l introduced Q-learning as follows: 

Given any estimate Q 0 , Q-learning algorithm can be represented by following up¬ 
date rules: 

Qt+i(s,a) = Qt(s,a) + a t (s, a)A t (1) 


where Q t denotes the estimation of Q* at time t, {ctt(s, a)} denotes the step-size se¬ 
quence and At denotes temporal difference at time t, 


A t = r t + 7 maxQt(s / , a') - Qt(s,a) 

a'eA 


( 2 ) 


where r t is the reward received at time step t. The update Equation [2] does not depen¬ 
dent any policies, so the Q-learning is called off-policy algorithm. 

In contrast to off-policy algorithms, there are some on-policy algorithms in which 
Sarsa is the analogy of Q-learning Rummery and Niranjan (1994 1 . Given any estimate 
Q 0 and a policy 7 r, the difference between Q-learning and Sarsa is that the temporal 
difference A t : 

At=r t +'yQt(s',a')-Q t (s,a) (3) 


where a' is determined by the policy 7r and r t is the reward received at time step t. 
Notice that the action selection in Equation[3]involves the policy n, making it on-policy. 

If both S and A are finite sets, the Q-value function can be easily represented by an 
,Sj x \A\ matrix and it can be represented in a computer by a table. This matrix represen¬ 
tation is also called tabular representation. In this case, the convergence of Q-learning, 

shown by previous work 
). However, if S or A is 
infinite or very large, it is infeasible to use tabular representation and a compact repre¬ 
sentation is required (i.e., function approximation). This paper focuses on Q-learning 
with linear function approximation and Sarsa with linear function approximation. The 
linear approximation means that state-value function Q can be represented by a linear 
combination of features {fa}f =1 , where fa : S x A —> R is the feature and d is the 
number of features. Given a state s £ S and an action a G ,4, the action value at time 
step t is defined as 


Sarsa, and other related algorithms (such as TD(A)) have been 
Peter (19921; Watkins and Dayan ( |1992 1 ; Singh et al. (2000 


d 

Qt(s,a) = 5>(i)*(,,a) = 6jfas,a) (4) 

2=1 

where 6 t and <f> are d-dimensional column vectors and 1 denotes the transpose operator. 
Since ci> is fixed, algorithms only are able to update 9 t each time. Gradient-descent 
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methods are one of most widely used of all function approximation methods. Applying 
a gradient-descent method to Equation[T| we obtain approximate Q-learning: 

&t+i = 0i + o: t (s, a)VQ t (s, a)A t 
= 0 t + a t (s,a)4>{s,a) A t 

where {a t } is the update parameter at time t and A t is the temporal difference at time 
step t, (Equation [2|. Similarly, given a policy n, the on-policy temporal difference can 
be defined as 


A t = r t + 'jQtis', a') - Q t (s , a) 

= r t + 7 0j 4>{s', a) - 0f a), 

where a' is determined by the policy 7 r at time t. Combining Equation[5]and Equation[ 6 j 
we obtain Sarsa with linear approximation. For a set fixed features {4>, : S x A —> M}, 
our goal is to learn a parameter vector 0 * such that 0j3>(s, a) approximates the optimal 
Q-value Q*. 


Online Transfer Framework 


This section introduces a framework for online transfer learning in reinforcement learn¬ 
ing domains, inspired by previous work Lazaric ( 2012j >. 


Online Transfer 

Transfer learning is a technique that leverages past knowledge in one or more source 
tasks to improve the learning performance of a learning algorithm in a target task. 
Therefore, the key is to describe the knowledge transferred between different algo¬ 
rithms. A standard reinforcement learning algorithm usually takes input some raw 
knowledge of the task and returns a solution in a possible set of solutions. We use 
XXX to denote the space of the possible input knowledge for learning algorithms and 
■XX' to denote the space of hypotheses (possible solutions, e.g., policies and value func¬ 
tions). Specifically, .'X? refers to all the necessary input information for computing a 
solution of a task, e.g., samples, features and learning rate. 

In general, the objective of transfer learning is to reduce the need for samples from 
the target task by taking advantage of prior knowledge. An online transfer learning 
algorithm can been defined by a sequence of transferring and learning phases, e.g., 
1) transferring knowledeg, 2) learning, 3) transferring based on previous learning, 4) 
learning, etc. Let J(f s L be the knowledge from L source tasks, .Xf t ‘ be the knowledge 
collected from the target task at time i and ■'X / l ' f , arri be the knowledge obtained from 
learning algorithm at time i (including previous learning phases). We define one time 
step as one-step update in a learning algorithms or one batch update in batch learn¬ 
ing algorithms. Thus, the algorithm may transfer one-step knowledge, or one-episode 
knowledge, or even one-task or multi-task knowledge to the learner, depending on the 
setting. .'Xf ' denote the knowledge space with respect to time i such that XXX 1 C ,'XX, for 
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all i = 0,1, 2,.... The online transfer learning algorithm can be defined as 


8$transfer '■ AC S X X ^learn ~* ^transfe 


( 7 ) 


where ^transfer denotes the knowledge transferred to the learning phase at time i, 
i = 0,1, 2,.... Notice that J(Tf earn is generated by the learning algorithm. Thus, the 
reinforcement learning algorithm can be formally described as 


Mearn = transfer X Altrn X ^ +1 (») 

where is the knowledge from learning algorithm at time i + 1 and is the 

hypothesis space at time i + 1, i = 0,1, 2 ,.... is used as input for next time 

step in online transfer Equation [7j Then, si transfer generates the transferred knowl¬ 
edge JQ. ansfer for learning phase in Equation^ Mearn computes the for the 

next time step, and so on. In practice, the initial knowledge from the learning phase, 
■ '^ilarn can t> e empty or any default value. In this framework, we expect the hypothe¬ 
sis space sequence , M 1 ,... will become better and better over time under 

some criteria (e.g., the maximum average reward or the maximum discounted reward), 
where M 31 C M 1 is the space of hypothesis with respect to i, i = 0,1,2,..., that is, 
the space of possible solutions at time i. See Figure|T|for an illustration. 

Example 1. Consider the Active Relocation Model \Mihalkova and Mooney\ ( | 2006) . 
In this setting, there is an expert and a learner, which can be treated as the transfer 
algorithm srf transfer an d the learning algorithm s^i earn , respectively. The learner is 
able to relocate its current state to a visited state, but the learner may become stuck in a 
sub-optimal state. Thus, the expert is able to help the learner to relocate its current state 
to a better state according to the expert’s knowledge. This algorithm can be represented 
in our framework as TfT s = (S x A x S x R) Nb , where N s is the number of samples 
the expert collect from the source tasks, JCf = (Si x A, x Si x Ri) Ni , driff ear . n = 
(Qi x SiX Ai), Xt l ransfer = (Si) and Jl? l+1 = {Q l+ i}, i = 0,1,..., rQ 

Although we explicitly introduce Jff and Jff earn in Equation [7] and [ 8 ] in most 
settings, it is impossible for the transfer algorithm and learning algorithm to explicitly 
access the knowledge from target tasks or it only has a limited access to it. For example, 
the communication failure and restrictions may cause these problems. 


Transfer Learning and Online Transfer Learning 


Our online transfer learning framework can be treated as an online extension of the 

(2012). If we set all J^* aT , n = 0 and set i = 0, we 


transfer learning framework 
have 

^transfer 


Lazaric 


: JT/ X JT t 0 X 0 jCansfer 


(9) 


2 Si, Ai, Ri are all subsets of the set S of states, the set A of actions, and the set R of rewards, re¬ 
spectively. Qi is the estimate of Q-value function at time i. We introduce the index to distinguish the the 
difference in different time steps. For example, the learning algorithm is able to reach more states at time 
step i + l than at time step i. Thus, Si C S%+i. 
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Figure 1: (Top) The standard learning process only requires original knowledge from 
the target tasks. (Bottom) In the online transfer learning process, transfer algorithm 
takes input knowledge from source tasks, target task and learner at time i and output 
transfer knowledge at time i, then the leaning algorithm takes the transfer knowledge 
at time i to generate hypothesis at time i + 1. This process will repeat until a good 
hypothesis is computed. 


£/ learn : Jt£ anafer x (10) 

where the the ^/transfer transfers the knowledge to the s^i earn once, returning to the 
classic transfer learning scenario. 


Advice Model with Budget 


Now we discuss an advice method in previous work Torrey and Taylor ( |2013) , a con¬ 
crete implementation of online transfer learning. Suppose that the teacher has learned 
an effective policy n t for a given task. Using this fixed policy, it will teach students 
beginning to learn the same task. As the student learns, the teacher will observe each 
state s the student encounters and each action a the student takes. Having a budget of 
B advice, the teacher can choose to advise the student in n < B of these states to take 
the “correct” action n t (s). 

The authors Torrey and Taylor ( 2013) > assumed the teacher’s action advice is always 
correct and that students were required to execute suggested actions. Suppose that a 
reinforcement learning teacher agent T is trained in a task and has access to its learned 
Q-Value function Q t . Then, a student agent S begins training in the task and is able 
to accept advice in the form of actions from the teacher. We use notation (T, S, ixf) to 
denote the advice model where T is the teacher agent, S is the student agent and tt,/ is 
the policy teacher that provides its advice to student. The following example illustrates 
the how to characterize this advice model in the context of our online transfer learning 
framework. 


Example 2. Let us consider the advice model using Mistake Correcting approach with 
limited budget and linear function approximation Torrey and Taylor ( j2 013 ). In this 
model, there is a teacher and a student, which can be treated as the transfer algorithm 
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and the learning algorithm, respectively. First, the transfer algorithm Mtransfer col¬ 
lects N s samples from L source tasks. Then, it will return an advice action a to the 
learning algorithm s^i ea m according to the current state and the action observed from 
the learning algorithm (initial knowledge is empty). The learning algorithm £/i earn 
takes the advice action a and samples from target task and returns a state and a 
action for next step, meanwhile, the s^i earn maintains a function in the space 
spanned by the features {A'}"=i> where (f)j : S X A —> R is defined by a domain 
expert. Moreover, the teacher has a limited budget n for advising the student, so the 
time step i = 0,1,... ,n. Therefore, we have J(T S = (S x A x S x R) N ”, JPf = 
(Si x Ai x Six Ri) Ni , ^ arn = (Si x Ai), JQ. ansfer = (A) and J^ i+1 = 
{/(-.-) = Ej=i i=0,l,...,n. 


Theoretical Analysis 

For an advice model (T, S, ltd) we propose in this paper, the most important theoretical 
problem is to resolve the convergence of algorithms since it guarantees the correct¬ 
ness of algorithms. In the next subsection, we will discuss how action advice interacts 
with the tabular versions of Q-learning and Sarsa. After, the corresponding algorithms 
with linear function approximation are discussed. 


Tabular Representation 


The convergence of Q(0) (Q-learning) has been established by many works Watkins 


and Dayan (19921; Jaakkola et al. (19941; Tsitsiklis (1994); Mohri et al. (2012). 


Lemma 1. ( [ Mohri et al.[ ( [2072] ) Theorem 14.9 page 332) Let Al be a finite MDP. 
Suppose that for all s € S and a £ A, the step-size sequence { at(st , a*)} such that 


^a t (s t ,a t ) = oo Yat(s t ,at ) 2 < oo, 

t t 


Then, the Q-learning Algorithm converges with probability 1. 

Notice that the conditions on aqst, at) ensure the infinity visits of action-state pairs. 

Theorem 1. Given an advice model (T,S,nd), the student S adopts the Q-learning 
Algorithm and conditions in Lemma^all hold, convergence of Q-learning still holds 
in the advice model setting. 

Proof. Notice that the conditions on at(st,at) verifies that each state-action pair is 
visited infinitely many times. And there is finite advice in our advice model. Therefore, 
the assumptions still hold in advice model setting. Apply Lemma |T| the convergence 
result follows. □ 


Compared to Q-learning, Sarsa is a on-policy algorithm which requires a learning 
policy to update the Q values. Singh et al. ( |2000) > prove that Sarsa with GLIE policy 
converges. We use their result to prove the convergence of Sarsa in advice model. First 
of all, we need to define GLIE policy. 
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Definition 1. A decaying policy n is called GLIE, greedy in the limit with infinite 
exploration, policy, if it satisfies following two conditions: 

• each state-action pair is visited infinity many times; 

• the policy is greedy with respect to the Q-value function with probability 1. 

It is not hard to verify that the Boltzmann exploration policy satisfies the above two 
conditions. Then we provide the result from Singh et al.. 

Lemma 2. {Singh et~aT7| ( | 200(f) ) Let M be a finite MDP and it is a GLIE policy. If the 
step-size sequence {at(st, a*)} such that 

^a t (s t ,a t ) = oo ^ a t (s t , at) 2 < oo, 

t t 


Then, the Sarsa Algorithm converges with probability 1. 

Proof Singh et al. ( j2000[ > prove a similar convergence result under a weaker assump¬ 
tion, they assume that Var(r(s,a)) < oo . In this paper, we assume that r(s,a) is 
bounded, that is |r(s, a)| < oo for all ( s,a ) pairs, which implies Var(r(s,a)) < 
oo. □ 


Theorem 2. Given an advice model (T, S , nf), the student S adopts the Sarsa Algo¬ 
rithm and conditions in Lemma^all hold, convergence of Sarsa still holds in the advice 
model setting 

Proof. Notice that the GLIE policy guarantee that each state-action pair is visited in¬ 
finitely many times. And there is finite advice in our advice model. Therefore, the 
assumptions still hold in advice model setting. Apply Lemma[2] the convergence result 
follows. □ 


Remark 1. On one hand, the key for the convergence results is that each state-action 
pair is visited infinitely often. For an advice model, the finite budget does not invalidate 
the infinite visit assumption. Therefore, the results follows from previous convergence 
results hold. On the other hand, the infinite visit assumption is a sufficient condition 
for the convergence result — if the assumption does not hold, the convergence may still 
hold. Moreover, the algorithms converge even if the budget is infinite as long as the 
student is still able to visit all state-action pairs infinitely many times. 


Linear Function Approximation 

In the previous subsection, we discuss some results regarding tabular representation 
learning algorithms that require an MDP with finite states and actions at each state. 
However, infinite or large state-action space in practice is very important since they 
can characterize many realistic scenarios. 

The convergence of Q-learning and Sarsa with linear approximation in standard 
setting has been proved Melo et al. (2008), provided the relevant assumptions hold. 
Our approach is inspired by this work, which assumes that the algorithm (Q-learning or 
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Sarsa, with linear approximation) holds under the convergence conditions in Melo et al. 
(2008). We then apply the convergence theorems to the action advice model (T, S. tt,/), 
and the results follow. 

We need to define some notations for simplifying our proofs. Given an MDP M = 
(S, A, P , R, 7 ) with a compact state set S and a fixed policy n, A4 = ( S , P w ) is the 
Markov chain induced by policy 7r. Assume that the chain A4 is uniformly ergodic with 
invariant probability measure ps over S and the policy n satisfies 7r(s, a) > 0 for all 
a £ A and all s £ S with non-zero ps measurqj Let j.i-~ be the probability measure 
for all Borel-measurable set U C S and for all action a £ A, 


Hn(U x {a}) = / 7r(s, a)ns(ds). 

Ju 


Suppose that {<fii}f =0 is a set of bounded, linearly independent features, we define 
matrix E T as 


T, v = E[(j) T (s,a)(l){s,a)] = / <A T (s, a)<j>(s, a)dp n 


I SxA 


Notice that E_ is independent of the initial probability distribution due to uniform 
ergodicity. 

For a fixed 0 £ M. d , d > 1 and a fixed state s £ S, define the set of optimal actions 
in state s as 


Ag, s = 


£ A 


8 T <l>(s, a*) = m&xO T 4>(s, a) 

a£A 


A policy 7 r is greedy w.r.t. 6 which assigns positive probability only to actions in Aq s . 
We define 0-dependent matrix 


£*(0) = [0 T (s, a e , s )^(s, a 0iS )], 


where ag iS is a random action determined by policy n at state s in set Notice that 
the difference between E„. and E* (0) is that the actions are taken according to 7 r in E_ 
while in E* (0) they are taken greedily w.r.t. a fixed 0, that is, actions in Ag s . 

We will show that Q-learning with linear function approximation still converges in 
the advice model setting at first. We introduce following lemma: 


Lemma 3. ( Melo et al. ( 2008 Theorem 1 ) Let M, n and {<fii} d =Q be defined as above, 
if, for all 0, E^ > 7 2 E*( 0 ) and the step-size sequence {at(st,at)} such that 


^a t (s t ,a t ) = 00 ^ a t {st,at ) 2 < 00, 

t t 


then the Algorithm Q-learning with linear approximation converges with probability 1 

Theorem 3. Given an advice model (T, S , 7 id), if the Markov chain which is induced 
by tt,i is also uniformly ergodic and the student S adopts the Q-learning with linear 
approximation and conditions in Lemma^all hold, the convergence of Q-learning with 
linear approximation still hod in the advice model setting. 

3 This condition is able to be interpreted as the continuous counterpart of "infinite visit" in finite action- 
state space scenario. 
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Proof. Apply Lemma [3] the convergence result still hold in the advice model setting. 

□ 


Next, we will analyze the convergence of Sarsa with linear approximation in the 
advice model. Sarsa is an on-policy algorithm, we need some different assumptions. A 
policy 7T is e-greedy with respect to a Q-value function Q for a fixed 6, if it chooses a 
random action with probability e > 0 and a greedy action a £ Ag s for all state s £ S. 
A 0-dependent policy itg satisfies ng (■ s,a ) > 0 for all 6. Now we consider a policy 
7 rg t is e-greedy with respect to 8]<&(s. a) at each time step t and Lipshitz continuous 
with respect to 6, where K denotes the Lipshitz constant (refers to a specific metric^ 
Moreover, we assume that induced Markov chain Ai = (S, Pg) is uniformly ergodic. 


Lemma 4. ( Melo et al. (2008) Theorem 2) Let M, ng t and be defined as 

above. Let K be the Lipshcitz constant of the learning policy ttg w.r.t. 6. If the step-size 
sequence {at.(st,a,t)} such that 


ot t (s t , a t ) = oo ^ at(st, at) 2 < oo, 

t t 

Then, there is Kq > 0 such that, if K < I\q, the Sarsa with linear approximation 
converges with probability 1. 

Theorem 4. Given an advice model (T, S, itf), ifttd A 6-dependent and e-greedy w.r.t. 
a fixed Ot at each time step t. The student S adopts the Sarsa with linear approximation 
and conditions in Lemma [£] all hold, Sarsa with linear approximation still converges 
with probability 1. 

Proof. Apply Lemma [4] the convergence result still hold in the advice model setting. 

□ 


Remark 2. Notice that we assume the budget of the teacher is finite which implies that 
any finite policies do not affect the convergence results as long as the conditions in 
Lemma [7] [2] [I]«/;(:/[?].Yfz7/ hold. Therefore, the student will eventually converge even if 
the teacher is sub-optimal. 


Asymptotic Performance 

Next, we will investigate the asymptotic behavior in the advice model. Most of conver¬ 
gence results rely on infinite experience, which is not suitable in practice — we first 
redefine the concept of convergence. 

Definition 2 (Convergence in Algorithm Design). If an algorithm 21 converges, then 
there exits a N £ N, for all t > N such that 

IIQt+i Qt| |oo — c 

4 Given two metric spaces (X,d x ) and ( Y, d y ), where dx and dy denotes metric on set X and Y, 
respectively. A function / : X —T Y is called Lipshitz continuous, if there exists a real constant K > 0 
such that for all x \, X 2 £ X, 

<fy(/(^l)i/(^2)) < Kd x (xi,x 2 ), 

where the constant K is called Lipshitz constant. 
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where e is very small constant. 

Theorem 5. If an algorithm 21 converges in terms of Definition [2] then finite advice 
cannot improve the asymptotic performance of algorithms 21 . 

Proof If an algorithm 21 converges, then there is a N € N for all / > A" such that 

IIQt+i Q#||oo £ 

where e is very small constant. Therefore, even if the advice is sub-optimal the student 
will always find the optimal action according to its own Q-value after N updates, that is, 
finite advice can not affect the asymptotic performance in the sense of infinite horizon. 
The asymptotic performance is determined by the algorithms that the student uses, not 
the advice provided by a teacher. □ 

Remark 3. Theorem [5] indicates the limitation of the advice model. Generally, there 
are two intuitive methods to improve the performance of student in the advice model: 
(1) higher amounts of advice, or (2) redistribution of the advice(e.g., delay the advice 
for when it is most useful). Our theorem points out that, with a finite budget for advice, 
the asymptotic performance is still determined by the algorithm that the student adopts 
as long as the algorithm converges. Furthermore, advice delay is limited also due to 
the convergence of the algorithm that the student uses. 


Experimental Domain and Results 

In this section, we introduce the experimental results in two domains. The goal of 
experiments is to provide experimental support for convergence proofs from the 
previous section, as well as to justify that action advice improves learning. The first 
domain is a simple linear chain of states: Linear Chain. The second is Pac-Man, a 
well-known arcade game. We will apply Q-learning with tabular learning to the Linear 
Chain and Sarsa with linear function approximation to Pac-Man. 


Linear Chain MDP 


The first experimental domain is the Linear Chain MDP Lagoudakis and Parr ( 2003j >. In 
this domain, we adopt Q-learning with the tabular representation to store the Q-values 
due to the simplicity. See Figure[3]for details. 

In this paper, the MDP has 50 states and two actions for each state: left and right, 
state 0 is the start state and state 49 is the final state in which the episode is terminated. 
The agent will receive —1 reward per step in non-terminated states and 0 in goal state. 

To smooth the variance in student performance, we average 300 independent trials 
of student learning. Each Linear Chain teacher is given an advice budget of n = 1000. 
The reinforcement learning parameters of the students are e = 0.1, a = 0.9 and 7 = 
0 . 8 . 

We use four experimental setting to demonstrate the convergence results: 

• Optimal Teacher: The teacher will always give the optimal action in each state, 
i.e., move right. 
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Figure 2: Top: Q-learning students in Linear Chain MDP. Bottom: Sarsa students 
Pac-Man domain. 












Group 

FR 

FR STD 

TR 

TR STD 

Optimal Teacher 

-53.99 

3.30 

-29007.01 

1384.31 

Random Teacher 

-54.56 

3.59 

-41670.98 

2398.87 

Poor Teacher 

-54.28 

3.58 

-43964.97 

2394.78 

No Advice 

-54.13 

3.221 

-42355.24 

2660.75 


Table 1: FR is the final reward of the last episode, FR STD is final reward’s standard 
deviation, TR is the total reward accumulated reward in all episodes, and TR STD is 
the standard deviation of total reward. 



Figure 3: Linear Chain MDP with 50 states. State 0 is the start state and state 49 is the 
goal state. 


• Random Teacher: The teacher will give action advice, 50% move left and 50% 
move right. 

• Poor Teacher: The poor teacher gives the worst action, e.g., move left. 

• No Advice: There is no advice, equivalent to normal reinforcement learning. 

Figure [2] (top) shows the results of these experiments (note the log scale on the 
y-axis). All settings converge after 280 episodes training despite different teacher per¬ 
formance. 

To compare methods, we calculate the area under each learning curve. We apply 
one-way ANOVA to test the difference between all settings and the result shows that 
the p < 2 x 10 _ 16 , indicating that we should reject the null hypothesis that “all test 
groups have same means.” Therefore, all experimental settings are statistically differ¬ 
ent, where the optimal teacher outperforms the random teacher, which outperforms no 
advice, which outperforms the poor teacher. Also, we provide the final reward, standard 
deviation of final reward, total reward and standard deviation of final reward on Table 
□ 


Pacman 


Pac-Man is a famous 1980s arcade game in which the player navigates a maze, trying 
to earn points by touching edible items and trying to avoid being caught by the four 
ghosts. We use a JAVA implementation of the game provided by the Ms. Pac-Man vs. 
Ghosts League Rohlfshagen and L ucas| ( j2011] >. This domain is discrete but has a very 
large state space due to different position combination of player and all ghosts — linear 
function approximation is used to represent state. Student agents learn the task using 
Sarsa and a state representation defined by 7 features that count objects at a range of 
distances, as used (and defined) in Torrey and Taylor ( [2013 1. 

To smooth the natural variance in student performance, each learning curve aver¬ 
ages 30 independent trials of student learning. While training, an agent pauses every 
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Group 

FR 

FR STD 

TR 

TR STD 

Correct Teacher 

3746.75 

192.18 

341790.99 

5936.23 

Random Teacher 

3649.78 

167.86 

313151.06 

4634.88 

Poor Teacher 

3775.13 

148.34 

307926.03 

7708.45 

No Advice 

3766.58 

132.41 

318072.70 

7660.44 


Table 2: FR, FR STD, TR and TR STD are same as those in TablejT] 


few episodes to perform at least 30 evaluation episodes and record its average perfor¬ 
mance — graphs show the performance of students when they are 1 ) not learning and 
2 ) not receiving advice. 

Each Pac-Man teacher is given an advice budget of n = 1000, which is half the 
number of the step limit in a single episode. The reinforcement learning parameters of 
the students are e = 0.05, a = 0.001 and 7 = 0.999. 

To demonstrate that finite advice can not affect the convergence of students, we 
adopt different experimental settings: 

• Correct Teacher: Provide the (near-)optimal action when it observes the student 
is about to execute a sub-optimal action. 

• Random Teacher: Provide random action suggestion from the set of legal moves. 

• Poor Teacher: Advise the student to take the action with the lowest Q-value 
whenever the student is about to execute a sub-optimal action. 

• No Advice: There is no advice, equivalent to normal reinforcement learning. 

See the experimental results in Figure [2](bottom). All settings converges after 900 
episodes training despite different teacher performance. As before, a one-way ANOVA 
is used to test the total reward accumulated by the four different teaching conditions. 
p < 4.6 x HP 13 , showing that all experimental settings are statistically different, and 
that the correct teacher was better than no advice, which was better than the random 
teacher, which was better than the poor teacher. Also, we provide rewards on Table [2] 


Related Work 


This section briefly outline related work in transfer learning in reinforcement domains, 
online transfer learning in supervised learning, and algorithmic teaching. 

Transfer learning in reinforcement domain has been studies recently |Tay lor and 
Stone (2009 1 ; Lazaric (2012 1 . Lazaric introduces a transfer learning framework which 
inspires us to develop the online transfer learning framework and classifies transfer 
learning in reinforcement domain into three categories: instance transfer, representation 


transfer and parameter transfer Lazaric |20l2). The action advice model is a method of 
instance transfer due to explicit action advice (i.e., sample transfer). Lazaric proposed 
an instance-transfer method which selectively transfers samples on the basis of the 
similarity between source and target tasks Lazaric et al. (2008}. 
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Azar et al. ( 2013[ > introduced a model that takes the teacher/advice model as input 
and a learning reinforcement learning algorithm is able to query the input advice policy 
as it is necessary. However, their model does not consider the learning reinforcement 
learning algorithm behavior, which we believe is important in online reinforcement 
learning. 

Zhao and Hoi propose an online transfer learning framework in supervised learning 


Zhao and Hoi (20101, aiming to transfer useful knowledge from some source domain 


to an online learning task on a target domain. They introduce a framework to solve 
transfer in two different settings. The first is that source tasks share the same domain 
as target tasks and the second is that the source domain and target domain are different 
domain. 

Finally, a branch in computational learning theory called algorithmic teaching tries 
to understand teaching in theoretical ways Balbach and Zeugmann ( 2009| >. In algorith¬ 
mic learning theory, the teacher usually determines a example sequence and teach the 
sequence to the learner. There are a lot of algorithmic teaching models such as teach¬ 
ing dimension Goldman and Kearns ( 1995[ ) and teaching learners with restricted mind 
changes Balbach and Zeugmann ( 2005| l. However, those models still concentrate on 
supervised learning. Cakmak and Lopes { 2012) developed a teaching method which 
is based on algorithm teaching, but their work focuses on one-time optimal teaching 
sequence computing, which lacks the online setting. 


Discussion 


This paper proposes an online transfer learning framework. It then characterizes two 
existing works addressing teaching in reinforcement learning. A theoretical analysis 
of one of the methods, where teachers provide action advice, lead us to the following 
conclusions. First, Q-learning and Sarsa converge to the optimal Q-valuewhen there is 
a finite amount of advice. Second, with linear function approximation, Q-learning and 
Sarsa converge to the optimal Q-value , assuming normal assumptions hold. Third,there 
is a limit of the advice model: teacher advice can not affect the asymptotic performance 
of any algorithms that converge. Fourth, our results are empirically justified in the 
Linear Chain MDP and in Pac-Man. 

In the future, sample complexity and regret analysis for the advice model will be 
investigated, now that the convergence results have been established. Additional mod¬ 
els under the online transfer framework will be developed, which will not only focus 
on interaction between machines, but also consider interaction between machines and 


humans (e.g., learning from demonstration Argali et al. (2009 1 ). Finally, we will con¬ 
sider other reinforcement learning algorithms such as R-Max and study the theoretical 
properties of those algorithms in the presence of the advice model. 
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